Skip to content

fix: correct 4 webapp bugs from issue #3748#3749

Closed
deepshekhardas wants to merge 20 commits into
triggerdotdev:mainfrom
deepshekhardas:fix/3748-logs-presenter-replay-ai-agent
Closed

fix: correct 4 webapp bugs from issue #3748#3749
deepshekhardas wants to merge 20 commits into
triggerdotdev:mainfrom
deepshekhardas:fix/3748-logs-presenter-replay-ai-agent

Conversation

@deepshekhardas
Copy link
Copy Markdown

Fixes 4 bugs identified in issue #3748:

  1. LogsListPresenter: Inverted ClickHouse guard - removed the CLICKHOUSE error block so ClickHouse queries proceed correctly.

  2. ReplayRunDialog: Missing replayDataFetcher.data dependency in useEffect.

  3. AIFilterInput: Changed fetcher state check from loading to idle.

  4. AgentView: Added useEffect to reset refs when sessionId changes.

Deploy Bot and others added 20 commits February 2, 2026 16:16
- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913)
- Include PR body drafts for consolidated tracking
- Include reproduction scripts for Sentry (triggerdotdev#2900) and engine strictness (triggerdotdev#2913)
- Include PR body drafts for consolidated tracking
When the underlying logical-replication client errored (e.g. after a
Postgres failover), the runs and sessions replication services logged
the error and left the stream stopped. The host process kept running,
the WAL backed up, and ClickHouse silently fell behind.

Both services now run a configurable recovery strategy on stream errors,
defaulting to in-process reconnect with exponential backoff so a fresh
self-hosted setup heals on its own:

- "reconnect" (default) re-subscribes via the existing subscribe(lastLsn)
  path with exponential backoff (1s -> 60s cap, unlimited attempts), which
  re-validates the publication, re-acquires the leader lock, and resumes
  from the last acknowledged LSN.
- "exit" calls process.exit after a short flush window so a host's
  supervisor (Docker restart=always, systemd, k8s, etc.) can replace the
  process.
- "log" preserves the historical behaviour.

Per-service strategy + exit knobs are env-driven via
RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY plus
matching *_EXIT_DELAY_MS / *_EXIT_CODE. Reconnect tuning is shared
across both services via REPLICATION_RECONNECT_INITIAL_DELAY_MS /
_MAX_DELAY_MS / _MAX_ATTEMPTS (0 = unlimited).
Addresses PR review feedback:

- LogicalReplicationClient.subscribe() can throw before its internal
  "error" listener is wired up (notably when pg client.connect() fails
  mid-failover). The reconnect strategy's catch block only logged, so
  recovery silently stopped. Now also calls scheduleReconnect(err) — the
  pendingReconnect guard makes it idempotent if an error event was also
  emitted.
- Reject negative values for the new replication-recovery env vars and
  cap exit codes at 255.
- Convert the new ReplicationErrorRecovery{Deps,} interfaces to type
  aliases to match the repo's TypeScript style.
- Tighten the reconnect dep comment to drop a stale "lastAcknowledgedLsn"
  reference (the wrapper-tracked resume LSN is what callers actually pass).
- Restore process.exit after service.shutdown() in the exit-strategy
  test so a delayed exit timer can't terminate the test worker.
LogicalReplicationClient.subscribe() can resolve without throwing or
emitting an "error" event when leader-lock acquisition fails — it just
calls this.stop() and returns. The reconnect callback now checks
isStopped after subscribe() and throws so the recovery handler can
schedule the next attempt instead of silently giving up.
…rough handle()

The previous post-subscribe() isStopped check was always true on the
happy path: subscribe() calls stop() up front (setting _isStopped=true)
and only resets the flag inside the replicationStart event, which fires
asynchronously after subscribe() returns. So the check threw on every
successful reconnect, the catch rescheduled, the next attempt tore down
the just-built client, and the cycle continued — replication briefly
worked between teardowns, which is why the integration test passed.

Replace it with the correct nudge: subscribe to leaderElection and call
the recovery handler on isLeader=false. That's the only subscribe()
exit path that doesn't either throw or emit an "error" event (the other
silent-return paths emit "error" first via createPublication/createSlot
failures).
The previous commit routed leaderElection(false) through handle(), which
under the exit strategy schedules process.exit. In a multi-instance
deployment that turns lost leader election — a normal operational state
— into a restart loop: exit, supervisor restarts, election fails again,
exit, and so on.

Add a dedicated notifyLeaderElectionLost() on ReplicationErrorRecovery
that the reconnect strategy treats as another retry trigger, while
exit and log strategies no-op. Wire the wrapper services through the
new method.
fix(webapp): auto-recover replication services after stream errors
- LogsListPresenter: Remove inverted ClickHouse guard that blocked valid log queries
- ReplayRunDialog: Add missing replayDataFetcher.data useEffect dependency
- AIFilterInput: Fix wrong fetcher state check (loading -> idle)
- AgentView: Reset refs when session changes to prevent stale data bleed
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 26, 2026

🦋 Changeset detected

Latest commit: 913f7c7

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions
Copy link
Copy Markdown
Contributor

Hi @deepshekhardas, thanks for your interest in contributing!

This project requires that pull request authors are vouched, and you are not in the list of vouched users.

This PR will be closed automatically. See https://github.com/triggerdotdev/trigger.dev/blob/main/CONTRIBUTING.md for more details.

@github-actions github-actions Bot closed this May 26, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 03ae44f4-13f7-4d2e-81a3-1b7397f2a96a

📥 Commits

Reviewing files that changed from the base of the PR and between 37eeaa3 and 913f7c7.

📒 Files selected for processing (32)
  • .changeset/fix-console-interceptor-2900.md
  • .changeset/fix-docker-hub-rate-limit-2911.md
  • .changeset/fix-github-install-node-version-2913.md
  • .changeset/fix-orphaned-workers-2909.md
  • .changeset/fix-sentry-oom-2920.md
  • .server-changes/replication-error-recovery.md
  • apps/webapp/app/components/runs/v3/AIFilterInput.tsx
  • apps/webapp/app/components/runs/v3/ReplayRunDialog.tsx
  • apps/webapp/app/components/runs/v3/agent/AgentView.tsx
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LogsListPresenter.server.ts
  • apps/webapp/app/services/replicationErrorRecovery.server.ts
  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/services/sessionsReplicationInstance.server.ts
  • apps/webapp/app/services/sessionsReplicationService.server.ts
  • apps/webapp/test/runsReplicationService.errorRecovery.test.ts
  • consolidated_pr_body.md
  • packages/cli-v3/src/cli/common.ts
  • packages/cli-v3/src/commands/deploy.ts
  • packages/cli-v3/src/commands/dev.ts
  • packages/cli-v3/src/commands/login.ts
  • packages/cli-v3/src/commands/update.test.ts
  • packages/cli-v3/src/commands/update.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • packages/cli-v3/src/entryPoints/dev-index-worker.ts
  • packages/cli-v3/src/entryPoints/dev-run-worker.ts
  • packages/cli-v3/src/entryPoints/managed-index-worker.ts
  • packages/cli-v3/src/entryPoints/managed-run-worker.ts
  • packages/cli-v3/src/utilities/sourceMaps.test.ts
  • packages/cli-v3/src/utilities/sourceMaps.ts
  • packages/core/src/v3/consoleInterceptor.ts

Walkthrough

This PR consolidates eight major fixes across the trigger.dev monorepo. The most substantial addition is a replication error-recovery system that automatically recovers from Postgres logical replication failures via configurable strategies (reconnect with exponential backoff, exit for supervisor restart, or no-op logging). The system integrates into both runs and sessions replication services with environment-driven configuration. Additional improvements include a source-maps utility that conditionally enables source-map-support, CLI engine-check control for deployment builds, Docker Hub authentication support to avoid rate limits, SIGINT/SIGTERM signal handlers for dev cleanup, console interceptor enhancements to preserve log chains for other interceptors, and targeted fixes to web app components and presenters.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix/3748-logs-presenter-replay-ai-agent

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

@@ -260,8 +261,7 @@ export async function updateTriggerPackages(
await installDependencies({ cwd: projectPath, silent: true });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 ignoreEngines option added but never passed to installDependencies

The ignoreEngines option is added to UpdateCommandOptions (line 21) and passed from deploy.ts:262 as { ignoreEngines: true }, but the actual installDependencies call at packages/cli-v3/src/commands/update.ts:261 is await installDependencies({ cwd: projectPath, silent: true }) — it never reads options.ignoreEngines, never computes the per-package-manager flag (e.g. --no-engine-strict for npm), and never passes an args array. The tests in update.test.ts expect args: ["--no-engine-strict"] etc., but these tests will fail because the implementation is missing. The changeset claims to fix deployment failures when Node version mismatches exist, but the fix is not actually wired up.

Prompt for agents
The `ignoreEngines` option is declared in UpdateCommandOptions and passed from deploy.ts, but the installDependencies call at update.ts:261 never uses it. The function needs to:
1. Read `options.ignoreEngines` and `packageManager.name` (already detected at line 254).
2. Compute the appropriate engine-ignore flag: `--no-engine-strict` for npm, `--config.engine-strict=false` for pnpm, `--ignore-engines` for yarn.
3. Pass it as `args: [flag]` (or `args: []` when ignoreEngines is false) to the `installDependencies({ cwd: projectPath, silent: true, args: computedArgs })` call.

The test file update.test.ts already documents the expected behavior for each package manager. The implementation just needs to be added between line 254 (detectPackageManager) and line 261 (installDependencies).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +98 to +100
case SeverityNumber.INFO:
this.originalConsole.log(...args);
break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 console.info() delegates to original console.log() instead of console.info() due to shared SeverityNumber

Both log() (packages/core/src/v3/consoleInterceptor.ts:72) and info() (line 76) map to SeverityNumber.INFO. The new switch statement at line 98 has a single case SeverityNumber.INFO branch that always calls this.originalConsole.log(...args). This means console.info() calls are routed to the original console.log instead of the original console.info. Since the stated purpose of this change is to "preserve log chain when other interceptors (like Sentry) are present", and Sentry patches console.info and console.log separately, this partially defeats the fix — Sentry's console.info hook is bypassed.

Suggested approach

Pass the original method name (or the severityText) through to #handleLog so the switch can distinguish log from info. Alternatively, add a dedicated parameter for which original method to delegate to.

Prompt for agents
In consoleInterceptor.ts, both the `log()` and `info()` public methods call `#handleLog` with `SeverityNumber.INFO`. The switch statement in `#handleLog` at line 97-113 routes based on `severityNumber`, so both end up calling `this.originalConsole.log(...)`. To fix this, the switch needs another signal to distinguish log vs info. Options:

1. Pass the target method name (e.g. 'log', 'info') as an additional parameter to `#handleLog`, and use it to pick the right `originalConsole` method.
2. Use `severityText` (already 'Log' vs 'Info') to disambiguate within the INFO case.
3. Give `info()` a different severity number (though semantically they are both INFO).

Option 1 is cleanest. Add a parameter like `originalMethod: 'log' | 'info' | 'warn' | 'error' | 'debug'` to `#handleLog` and use it to index into `this.originalConsole[originalMethod](...args)` when delegating.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines -218 to -219
if (store === EVENT_STORE_TYPES.CLICKHOUSE) {
throw new ServiceValidationError(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 LogsListPresenter: Removal of CLICKHOUSE store gate now allows queries against clickhouse_v1

The PR removes the guard at apps/webapp/app/presenters/v3/LogsListPresenter.server.ts:218-222 that threw an error when store === EVENT_STORE_TYPES.CLICKHOUSE. Previously, both POSTGRES and CLICKHOUSE stores were rejected, leaving only CLICKHOUSE_V2 as a valid path. Now CLICKHOUSE (v1) organizations will fall through to the same ClickHouse query builder used by v2. This is presumably intentional (the query builder uses a logsListQueryBuilder() that works against the same underlying table), but if there are schema differences between clickhouse v1 and v2 event storage, this could produce incorrect results for v1 orgs. Worth confirming this is intentional.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants